Bank Loan Classification - Supervised Learning Project

Objective:

The goal of this classification task is to predict the likelihood of a liability customer buying personal loans.

Executive Summary:

  As part of its customer acquisition efforts, Thera Bank wants to run a campaign to convince more of its current customers to accept personal loan offers. In order to improve targeting quality, they want to find customers that are most likely to accept the personal loan offer. The dataset is from a previous campaign on 5,000 customers, of which 480 of them accepted (successful conversion). 

The metric that will be used to evaluate the model's performance is the F1-Score. Although Accuracy is useful, we consider the F1-Score because the target class is unbalanced. It tries to maximize both precision and recall i.e., decrease False Positives and False Negatives.

We have obtained an F1-Score of approximately 0.99 and Accuracy of 0.99 for the best performing model.    

We go through the machine learning pipeline, starting with reading the dataset and exploring the data through plots and summaries. Then, we move to preprocess the data to standardize the data and check for any missing values. Later, we build models to classify the data. 

Finally, we evaluate the best models using the whole test dataset.

Attribute Types:


The variable ID does not add any interesting information. There is no association between a person's customer ID and loan. We can remove this attribute for our modelling.

We confirm that there are no missing values(NAs). Hence, we do not need to remove or impute missing
values. If there were missing values we do some value imputation or knn imputation

From a completeness point of view, the data looks great and there are no missing values.

Exploratory Data Analysis

Attribute Information:

Univariate Plots

Age feature is normally distributed with majority of customers falling between 30 years and 60 years of age. We can confirm this by looking at the describe statement above, which shows mean is almost equal to median
Experience is normally distributed with more customer having experience starting from 8 years. Here the mean is equal to median. There are negative values in the Experience. This could be a data input error as in general it is not possible to measure negative years of experience.
 
CCAvg is also a positively skewed variable and average spending is between 0K to 10K
Majority of the customers have income between 45K and 55K. records from the sample Income.
ZIP code is negatively skewed. We can see that values are from single region.
Mortgage contains most frequent value as 0

Most of the customer do not have Securities Account, CD Account and CreditCard
Relatively more number of customer use internet banking facilities
More number of customer are undergraduates and have a family size of one

Bivariate Plots

Personal Loan doesn't show variations with Age and Experience.
Zip Code seems to irrelevant too
Income has a good effect on Personal Loan. Customers with High Income have more chances of having Personal Loan
CCAvg also show a good relationship with Personal Loan. Customers with personal loan have high Avg. spending on credit cards per month
Customers who have high Mortgage have opted for Personal Loan

Family size does not have any strong impact in personal loan. But it seems families with size of 3 are a little bit more likely to take loan. When considering future campaign this might be good association.

Education level of the customers does impact whether or not they have a  personal loan. It seems customers with higher degrees seem to have personal loans more.

For the cusomters who have a Securities account with the bank, many of them do not seem to have a Personal Loan

For the cusomters who have a CD account with the bank, many of them seem to have a Personal Loan

Using Online banking doesn't seem to impact the chance of having a personal loan.

Having a credit card seems to impact the chance of having a personal loan.

Correlation heatmap

If there is multicollinearity, then we are unable to understand how one variable influences the target. There is no way to estimate separate influence of each variable on the target.
age and experience are highly positives correlated with each other. So, One of these attributes should be removed before modeling.
Also, income and CC_Avg (Average Credit Card spending) seem to be positively correlated with each other.

As only 9.6% of customers responded positively to the previous campaign, the target attribute is heavily imbalanced. So, we might need to employ techniques like upsampling, downsampling or smote so that classification is done properly.

Income, CD Account, Facilities, CC Avg, Education, Family, Mortgage, Securities Account seem to be strong predictors for the target variable and age, experience, zip code seem to have little bearing on the target variable and could be removed in Feature Selection before modelling the data

Data Preprocessing

Experience seems to have wrong values with 52 entries with negative entries. Also, age and experience are highly positively correlated. So, dropping experience column as models won't be able to learn properly when highly correlated variables are present.

Encoding Categorical attributes

Over Sampling SMOTE

Split Training and Testing Datasets

Scaling

Model Building

Logistic Regression

KNearest Neighbors Classifier

Naïve Bayes Classifier

Support Vector Classifier

Decision Tree Classifier

Random Forest Classifier

Hence, we can confirm that the most important features for the target prediction are indeed income, CC_Avg, Education, Family along with somewhat important features age, mortgage, banking, CD_Account, Securities_Account.


Predict Likelihood for a new customer to buy personal loan

We predict the likelihood of personal_loan == 1 (i.e, taking a personal loan) of the above sample data points using the models that were fitted on the dataset.
We could even set our own threshold for the probability at which we predict the output as 1 and make custom predictions and the threshold would depend on whether we want to avoid False positives more or False negatives more.

Model Evaluation

We have chosen the F1 Score as the metric to judge our models since we are concerned with Positive Class and the Classification Class is Imbalanced. So, Using F1 Score is a viable option to judge the models as it takes into consideration both False Positives and False Negatives while scoring.
Also, looking at the confusion matrices for all models we can see that the random forest classifier only makes 13 False Negatives and 8 False Positives.
Hence Random Forest Classifier is the best model with an F1 score of 0.991 and accuracy of 0.992

Random Forest Classifier is an ensemble model which employs multiple decision trees to classify. So, it's not prone to overfitting and generally quite robust in classification tasks. It's performace comes from the fact that it's an ensemble model which also makes use of the most important predictors.
Also, K Neighbors Classifier performs well on this dataset as it's able to learn from similar datapoints. i.e, Similar data points with similar values for the attributes tend to have similary target response.